1 Predicting the Number of Solar Panels in the United States

Fall 2021

Athan Chan: Modeling, writeups, debugging, project consulting

Brendan Co: Data Cleaning, merging datasets, writeups, final project aesthetics

Ayden Salazar: Modeling, hyperparameter tuning, writeups, visualizations

As a group, we strongly believe all the work for this project was equally divided and completed. Therefore, we all feel like we should be graded equally for each part of the project.

1.1 Abstract (5 points)

The main inspiration for the project is our desire to combat climate change as it poses a number of negative consequences for the planet such as extreme disasters (droughts, floods, and extreme temperatures), melting glaciers, and changing crop yields. With the recent Clean Electricity Performance Program under Biden, there has been 150 billion dollars of federal funds invested to combat climate change and accelerate development of sustainable energy. Our project’s focuses on predicting the number of solar panels in the United States at two different scopes. By being able to predict the number of solar panels, we can allocate resources, policies, and funding to certain areas with less solar panels. This allocation will inevitably increase equal access to solar panel electricity and possibly support policy decisions.

Our first prediction problem is predicting the number of solar panels systems for each county in the United States using 3 datasets (DeepSolar, Tracking The Sun, US Zipcode to County State to FIPS). Our second prediction is predicting the number of solar panel systems for each county in California using 4 datasets (3 datasets from the first prediction problem and SacBee). For both prediction problems, we use three prediction models: LASSO, RIDGE, and a neural network. In all models, neural networks performed the best on the data sets. Despite tuning the parameters for LASSO (R-squared: 0.749) and RIDGE (R-squared: 0.706), neural networks perform better with (R-squared: 0.798) with test errors. For the training data, LASSO and RIDGE did not overfit as much as the neural networks.

Through this project, we were able to learn how to produce results with real world data and it also taught us how difficult it could be to obtain the data we need for our prediction questions. Originally, we planned on using FIPS codes, but since the granularities did not match between other datasets, we had to change granularity to state-county FIPS codes (or the first five numbers of a FIPS code). Despite many twists and turns, we were able to find some meaningful results and determine which areas should be targeted (based on the visualization we produced at the end).

1.2 Project Background (5 points)

Climate Change has been a recent ongoing issue. The result of this problem has been greenhouse emissions such as water vapor, nitrous oxide, carbon dioxide and methane (NASA). These molecules are released by the increased use of fossil fuels to power a lot of the technology we use today. As technology advances, the demand for energy will increase and the planet can not sustain anymore carbon emissions. By 2050, we need to stop emitting carbon emissions or there will be irreversible damage done to the environment.

The main consequences of climate change are, the earth will be warmer on average, there will be individual places that will become warmer or wetter than others, a strong greenhouse effect that will melt glaciers and increase sea levels, greenhouse gases that may affect crop yields and other factors that we currently do not know. Additionally, there are other climate extremes caused by climate change such as droughts, floods, and extreme temperatures. It is also tested that climate change may lead to decrease in nutritional value for most food crops by reducing concentration of protein and minerals in most plant species. There is a 95 percent probability that human activity has warmed our planet, which means that the main cause of global warming is our daily emissions of greenhouse gases as humans living in a modern civilization.

To reduce fossil fuel consumption and combat climate change, there have been widespread efforts to adopt renewable energy such as wind energy and solar energy. In order to combat climate change, Biden released a Clean Electricity Performance Program, which invests 150 billion dollars into reducing emissions, creating new jobs, and growing the economy. His goal is to achieve 100% clean energy by 2035. He plans on using federal money to accelerate the development of sustainable energy (McIntyre and Murrow).

To achieve this difficult goal, Biden plans to invest in power companies that meet annual targets for clean energy. The program also plans on penalizing power companies that fail to meet their energy targets that are set by the government. For our project, we wanted to focus on solar energy in particular and wanted to find better ways to predict the amount of solar panels in a region. By having this information, solar panel marketers can target places that should get solar panels as additional energy generated can be sold back to the powerplant. We decided to use as many features as we could find to create a model that will determine how many solar panels are in a county. By targeting places that have less solar panels, we can help have more renewable energy systems to help achieve the global 2050 zero carbon emissions.

“The Causes of Climate Change.” NASA, NASA, 30 Nov. 2021, climate.nasa.gov/causes

September 14, 2021 Yvonne McIntyre Derek Murrow. “House Proposes Strong Clean Electricity Performance Program.” NRDC, 14 Sept. 2021, https://www.nrdc.org/experts/yvonne-mcintyre/house-proposes-strong-clean-electricity-performance-program.

1.3 Project Objective (5 points)

The purpose of our project is to predict how many solar systems are in each county in the United States and in California with different datasets. We felt that being able to predict solar systems can help us understand where places have the most solar panels and places that might need to invest in solar panels. Having these answers will ultimately help us become a more greener society as we can rely less on fossil fuels.

The purpose of this is to predict how many solar systems are in each county in California. After finding the SacBee dataset, we felt we could make a more accurate prediction model by including data about the political stance of each county. It is important to note that information on political leaning is not found in either the DeepSolar or Tracking The Sun dataset.

The resource allocation is adjusting which places may need more incentives to invest in solar panels. For example, if there is a county that has no incentives, not a lot of solar panels, but a lot of sunlight in the region, we would want to increase the incentives for solar panel installation in that region through grants, rebates, and more.

1.4 Input Data Description (5 points)

1.4.1 Question 1:

Deepsolar

We obtained this data from Stanford’s DeepSolar project website (http://web.stanford.edu/group/deepsolar/home). The Stanford researchers use a Convolutional Neural Network to classify solar panels using Google Inception V3, which replies upon a pretrained ImageNet to distinguish solar panels from a dataset containing 360K images. “Combining satellite imagery and deep learning, we aimed to develop a framework to automatically construct, maintain, and update the solar installation database and realize the next-level visibility on renewable energy deployment.” The data comes from the model generated by this framework.

SacBee

We obtained this data from the Sacramento Bee news outlet (https://www.sacbee.com/news/databases/article237132379.html), which sourced the dataset from the U.S Secretary of State Report of Registration from October 1, 2019.

Tracking the Sun

We obtained this dataset from data.gov (https://catalog.data.gov/dataset/tracking-the-sun). The Tracking the Sun report series is made by Berkeley Lab and summarizes installed prices and other trends among grid-connected, distributed solar photovoltaic (PV) systems in the United States. The data “derive primarily from project-level data reported to state agencies and utilities that administer PV incentive programs, solar renewable energy credit (SREC) registration systems, or interconnection processes” (https://github.com/openEDI/documentation/blob/main/TrackingtheSun.md).

US Zipcode to County State to FIPS

We obtained this dataset from data.world (https://data.world/niccolley/us-zipcode-to-county-state). This dataset was created to go between County - State Name, State-County FIPS, City, or to ZIP Code. The dataset was built on three data sources: US HUD, Census Bureau, and USPS Zip to City Lookup.

1.4.2 Question 2:

Deepsolar

Structure - how is the data stored? The data is stored as a csv. Each record is a unique fips number that is a region that is the estimate of solar panel values. There is no estimated data. Attached is the meta data deepsolar file. It has all 169 columns with explanations.

Granularity - how are the data aggregated (summed, averaged, etc)? Each record is a fips number which corresponds to a region. All record capture the same granularity. A lot of the values are rates over the region. If the data is aggregated a certain way, it would be in the column name (Ex: average_household_income)

Scope - how much time, how many people, what partial area? Our region of interest will be California, but we will try our best to generalize to the entire United States. This dataset has most fips regions so it is likely that we can extrapolate this to the entire United States. There are 3168 counties represented in our dataset. However, there are 3,242 counties in the entire United States. The data is very likely to be generalized to the entire population of the United States based on the amount of data we have.

Temporality - how is time represented in the data? On the DeepSolar website, they note that the dataset will be updated to generate a time-history of solar installations. However, time is not represented in the data.

Faithfulness - is the data trustworthy? The database was constructed by a team of Stanford scientists, which suggests that the credibility of the data is arguably strong. The database was developed through an autonomous framework that involves constructing, maintaining, and updating the solar installation information. Technical information on the solar panels were collected through satellite imagery and machine deep learning mechanisms. However, it is important to note that in depth columns unrelated to specific solar panel data (such average_household_income, land_area, population_density) are null values, because there was no available data for those values. Along with their dataset, DeepSolar has also utilizes other sources (ACS 2015 5-year estimates, EIA 2015, NASA Surface Meteorology and Solar Energy, townhall.com, dsireusa.org, and theguardian.com) for certain column values.

SacBee

Structure - how is the data stored? The SacBee data is stored as a .csv file. Each row is a city in California. The columns contians information pertaining to that city's percentage of voters who identify as Democrat, Republican, etc.

Granularity - how are the data aggregated (summed, averaged, etc)? Each record represents a city in California. The data in the columsn pertaining to the political parties is aggregated as a percentage, meaning that if one were to sum together Democrats + Republics + Third Party, etc, then one should get 100.

Scope - how much time, how many people, what partial area? The scope spans across the state of California. Each record spans across a city with X amount of people, with X being the value of the "Registered" column.

Temporality - how is time represented in the data? The data was collected from the Report of Registration (October 1, 2019) from the Secretary of State.

Faithfulness - is the data trustworthy? The data was collected from the Federal Government, meaning that as long as people self-reported their parties correctly, then the data should be reliable and credible.

Tracking the Sun

Structure - how is the data stored? All 29 datasets are parquet files. The data is organized in records/rows, where each one represents a solar panel and information on that solar panel (data provider, system id, installation date, installation price, and more). In the Ca_2020 dataset of Tracking the Sun, there are a total of 1136793 rows × 77 columns. Each column may contain different values (such as strings, integers, floats, etc). Each Tracking The Sun dataset represents a particular state in 2020.

Granularity - how are the data aggregated (summed, averaged, etc)? Each record represents a solar panel and information on that solar panel (data provider, system id, installation date, installation price, and more). Some records contain missing or null values. At a quick glance, the missing values are represented as the integer -9999. I do not believe the data was aggregated or summarized in any shape or form.

Scope - how much time, how many people, what partial area? The dataset attempts to cover the total number of solar panels in a particular state for a particular year. The dataset also contains the installation date of the solar panels.

Temporality - how is time represented in the data? Date and time fields in the dataset are only used for the installation time of the solar panels. The timestamps are represented in this format: “2019-12-06 07:00:00”.

Faithfulness - is the data trustworthy? Due to a large number of null/missing values, I believe that the dataset is trustworthy in the sense that there are no unrealistic values. Although an abundant amount of null values is not preferred, it is good practice to leave them as placeholders for the data. It is important to note that the dataset derives from data reported to state and ultities agencies that administer PV incentive programs. The data was further collected and cleaned by Berkeley Lab.

US Zipcode to County State to FIPS

Structure - how is the data stored? The data is stored as a csv. The dataset contains 53962 rows × 6 columns, where each row represents each zip code in the United States. There are six columns in the dataset (zip code, state county FIPS code, city, state, county name, and class code).

Granularity - how are the data aggregated (summed, averaged, etc)? Each record in the dataset represents each zip code in the Unite States. The data was collected through three sources (US HUD, Census Bureau, and USPS Zip to City Lookup). Overall, the raw data has not been summed, averaged, or grouped in any way.

Scope - how much time, how many people, what partial area? Each row in the dataset represents each zip code in the Unite States. Along with each zip code, it has information on the state county FIPS code, city, state, and county name that zip code has been assigned to.

Temporality - how is time represented in the data? Time is not represented in the data.

Faithfulness - is the data trustworthy? We believe the dataset is trustworth, because the source of the data are from government entities (US HUD, Census Bureau, USPS). However, it is important to note, this public dataset was created by an individual without any reputable affiliations.

Supporting Code for the SGSTF of the Data

1.4.3 Question 3

Target Variable: Number of solar systems for a county (state county FIPS code region)

Features (from DeepSolar):

Education level (bachelor, doctoral, less than high school) - For education level X, number of X-level people (as highest degree) after 25 years old

Heating fuel (coal, oil, kerosene) - For heating fuel type X, number of house units using X as heating fuel

Per Capita Income - per capita annual income in dollars

Poverty Level - number of poverty families below poverty level

Race (Black, White, Asian, Indian, Islander, Other) - For race X, number of people who identify as race X in region

Employment/Unemployment - Number of employed/unemployed people in the region

Median household income - median annual household income in dollars

Education High School Graduation/Dropout Rate - The high school graduation and dropout rate for a FIPS code

Average Household Size - The average household size for a FIPS code

Housing Unit Median Gross Rent - Median housing unit gross rent for FIPS code region in dollars

Earth temperature - Earth temperature (celsius) of FIPS code region

Age rate - ratio of people with ages between X and Y, for age ranges 5, 10, 15, 18, 24, 34, 44, 54, 65, 75, 85.

Occupation Rate - Ratio of people with occupation X, with X being {education, finance, retail, wholesale, etc.}

Mortgage with Rate - ratio of housing units with mortgage in FIPS code region

Transportation rate - ratio of walking/carpooling/driving to work in the FIPS code region

Travel time rate - ratio of taking less 20-29 min travel to work

Voting Percentage for Democrats, Republicans - Democrat/Republican voting percentage in 2012 and 2016 elections in FIPS code

Features (from SacBee):

Registered - total number of registered voters in the region

Dem - total number of democrats in the region

Rep - total number of republicans in the region

NPP - total number of no party preferences in the region

OTH - total number of other party preferences in the region

Democrats per Republican - Dem/Rep or total number of democrats/total number of republicans

Features (from Tracking the Sun):

total_installed_price - The total installation price for solar panels in the region

rebate_or_grant - Solar rebates or grants cash value, in dollars, for region

customer_segment - One hot encoded indicator of what sector(s) installed solar panels in the region (example: COM - commerical, RES - residential, etc)

1.5 Data Cleaning (10 points)

Deepsolar:

In order to clean the data, we need to get rid of all the null values in the deepsolar. We start off with 72537 rows, but then after we clean the data by removing all the nas, we get 54099 rows. To ensure that we only have the features, we set X to be only the features by removing columns that are closely related to the solar_system_count. Additionally, we set Y to be the solar_system_count, which is the amount we want to predict.

Tracking The Sun:

In this dataset, there were many null values and -9999 values (string and integer), so we had to remove those before modeling. We also had to add a zero in front of the FIPS codes in order to ensure the state code of the FIPS was accurate. This dataset also contained a zip_code column, which was cleaned to accurately represent zip codes.

See attached comments for further information.

In order merge deep_solar and tracking_the_sun, we must merge on the state-county fips code. In the code below, we create a column for deep_solar that contains the state-county fips code, which was the first five numbers of the original fips code in the deep_solar dataset. We also add a column called "ZIP (str)" to zip_county_fips, because the original zip codes were integers; we performed this operation to allow for future use and compatibility with the other datasets.

In order to merge deep_solar and tracking_the_sun, we must convert the zip codes from tracking_the_sun to state-county FIPS codes. Through the function below, we utilize the zip codes and state county FIPS codes from zip_county_fips to convert zip codes from tracking_the_sun to state-county FIPS codes.

Sacbee:

We grabbed the county name from SacBee and found its respective state-county FIPS code from zip_county_fips. We did this to merge SacBee with DeepSolar and Tracking The Sun. There was no other null values so the data cleaning was not necessary.

1.6 Data Summary and Exploratory Data Analysis (10 points)

Deepsolar: After making a scatterplot of the target and the feature, we noticed that many of the features are loosely related to the target variable.

Sacbee: Based on the visulaizations below, most counties in sacbees are majority democratic. The median for democrat is around 42 percent and for republicans it is around 25.

Tracking The Sun: Based on the graph, most of the rebates or grants are typically less than around 20000 with a few exceptions.

US Zipcode to County State to FIPS: No EDA was needed to be done for this dataset, because we only used it to convert zip codes and county names to state county FIPS codes. All the variables in this dataset were also categorical variables.

1.7 Forecasting and Prediction Modeling (25 points)

We started by importing all the datasets we needed to use and removing the null values. When we created the X and y, we made sure to remove columns that directly give us the answer such as tile_count, solar_system_count, etc. Before training the model, we standardized the columns by using StandardScaler(). We fit the model on the training set and see how the model performs on the test data. We do this process for both lasso and ridge regression.

For the neural network model, we used the Keras Deep Learning library to make a Sequential() model with a hidden layers with 4 neurons, and an output layer with one neuron. We used 1000 epochs and a loss function of 'mean_squared_error' and 'adam' as our optimizer.

1.7.1 Prediction Problem #1 (Using DeepSolar and Tracking The Sun)

1.7.1.a) Lasso Model

1.7.1.b) Ridge Model

1.7.1.c) Neural Networks

1.7.1.d) Hyperparameter Tuning for Lasso and Ridge Using KFold Cross Validation

Here, we use 5-Fold Cross Validation to find the optimal alpha (learning rate) values for our Lasso and Ridge models above.

1.7.1.e) Lasso and Ridge Models (After finding optimal alpha hyperparameters)

1.7.2 Prediction Problem #2 (Using DeepSolar, Tracking The Sun, and SacBee)

1.7.2.a) Lasso Model

1.7.1.b) Ridge Model

1.7.2.c) Neural Networks

1.7.2.d) Hyperparameter Tuning for Lasso and Ridge Using KFold Cross Validation

1.7.2.e) Lasso and Ridge Models (After finding optimal alpha hyperparameters)

1.7.3 Predicting the Number of Solar Panels for Unseen County Data In Alabama

In order to form predictions for unseen data, the first step was removing Alabama counties from the dataset. In our datatset, there were only two counties '01073' and '01097’, which were both removed. We then extracted the features for these unseen counties and fed them into our three prediction models: Lasso, Ridge, and Neural Networks.

Interpretation of Results for Unseen Data: Since neural networks got the highest test accuracy out of the two models for the seen data, we feel the most confident in our neural network prediction for the number of solar panels for the unseen counties. For the Neural Networks we got a prediction of 15.07 and 3.5 for the two counties in Alabama (county codes: '01073', '01097), respectively. In terms of the resource allocation problem of rebates and grants, governments should be wary of extending the model to places where data regarding solar panels does not exist since they risk lower accuracy in the model.

1.8 Interpretation and Conclusions (20 points)

We also wanted to make visualizations in the form of a heatmap and Choropleth to capture the density of solar panels in various U.S counties. To do these, we relied upon the Folium Python package. We used .JSON files of U.S states and counties to identify the locations on a latitude and longitude based map. We then inserted the predicted values for the solar panels for each of these counties.

1.8.1 Folium Heat Map For Visualizing Predicted Solar Panel Density Across U.S.A.

1.8.2 Choropleth Map For Visualizing Predicted Solar Panel Density Across U.S.A.

Important note: For the sake of having the color scheme be an appropriate range, we removed data for certain counties from California, which include a vast majority from the Bay Area and Los Angeles. This removal allowed us to have a better range for the colors of the Choropleth map.

1.8.3 Map Interpretations

After visualizing the results with a heatmap, we can see we are missing a lot of data from the area of the U.S around Montana and Wyoming. California among the states with the most solar panels. For the Choropleth map, we can see that the West Coast of the U.S tends to have higher densities of solar panels among the counties.

1.8.4 Conclusion and Final Thoughts

1.8.4a) Question 1:

An important resource allocation problem that our model helps solve is how to allocate rebates and grants related to solar. As we’ve explained in previous sections, Biden released a Clean Electricity Performance Program, which will invest 150 billion dollars into reducing emissions, creating new jobs, and growing the economy. An aspect of this program is rebates and grants, which are ways of enticing consumers and businesses into getting interested in solar.

Our advice to the government is to use our model to give them a sense of which areas don’t have much solar panels (i.e areas to target with rebates and grants) and which areas have significant amounts of solar panels (i.e areas to probably not target too much with rebates and grants). This would allow the government to more efficiently allocate their rebates and grants because there are communities that have high solar potential (i.e if they had solar, they could potentially save a significant amount in energy) but there isn’t a lot of knowledge about solar panels. Since not a lot of people have the background in solar in such a community, they end up not investing in solar. By targeting this region with rebates and grants, not only would the government be increasing the knowledge of solar panels in the area, but they could potentially increase the number of installed solar panels, as well. The issue is, however, the government needs a way to identify such communities. Our model provides a systematic way of identifying such communities with respect to features such as average household income and poverty level. Thus, readers should care about our results because it has many implications for how rebates and grants are allocated.

A second resource allocation problem that our model solves is how solar companies allocate their advertisement campaigns. Depending on how likely a region is to support solar panel installation, a solar panel company may or may not want to target them for a campaign. Thus, our model provides a way for solar panel companies to get a gauge of how popular solar is in a county because counties with higher solar panel counts will probably react positively to solar panel advertisements. Our recommendation to solar panel companies is to direct attention to areas where there are high densities of solar panels (i.e Los Angeles, California) so that they can maximize their profit in an area where solar is popular. On the flip side, they can also target regions that do not have many solar panels to sway areas to get solar panels. There could be more programs to promote education of how these rebates work and why installing solar panels will be beneficial not just to the household in terms of their electric bill, but also to helping the environment. This can help produce cleaner energy in these regions and help with our global goal of reaching zero carbon emissions in 2050. All in all, readers should care about these results because our model is much more simpler than Stanford’s DeepSolar algorithm and can be interpreted easily using our choropleth map.

1.8.4b) Question 2:

Since we changed the granularity from fips to county, there may have been some important data lost because of the granularity change. We used a regression line for LASSO and RIDGE to fit our results when in reality, the graph shows that the relationship between many combinations of these features and true solar system count may be different. As aligned with our results, the neural net performed the best out of our models, since it can model nonlinearly unlike LASSO and RIDGE. If we had further time, we would tune the parameters of the neural net to see if it can produce even better results as it looks like our model is slightly overfitting to the current data.

One reason that our model may be flawed is because we are assuming we have data on all counties. In the Choropleth graph, we realized that many counties lack data that we would need to predict solar panels. Based on the graph, it is mostly counties that are rural and may not have access to technology like other counties. Since these counties may not have the technology, it is hard to get the feature data needed to produce an accurate model. Since our model uses 173 columns, it means that there are a lot of features we use and need to collect from these communities if we want a full picture of the amount of solar panels in the United States.

In terms of prediction problems, the tendency for poor communities to lack access to technology skews our data in a certain direction. For example, some communities may have low education or low rebates because they are not as widely known. These regions were relatively unknown to us in terms of the features data and our prediction data.

This puts our model in a position that overestimates how many solar panels a county has. Since a county may not have access to technology, it is unlikely that these communities will have access to solar panels. Therefore, it is likely that our model will overestimate how many solar panels a region has. Most of the data is on counties that may have more access to technology and therefore will be more likely to have solar panels.

Another important factor that our results may be flawed is that we use deepsolar’s neural net predictions on solar panels. Since deepsolar will likely have their own errors, we are training on possibly their errors. We decided to use deepsolar because of the amount of features they have to offer as well as include some of our own features (tracking the sun and sacbee) to determine if we can produce more accurate predictions. Furthermore, we also could not find which values are the true solar count and which are predicted by deepsolar. They were all merged together which means we do not know the true error of our model.